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r ^ ■ Abstract. We study a new variant of the string matching problem called cross-document 

string matching, which is the problem of indexing a collection of documents to support an 
00 , efficient search for a pattern in a selected document, where the pattern itself is a substring of 

another document. Several variants of this problem are considered, and efficient linear-space 
solutions are proposed with query time bounds that either do not depend at all on the pattern 

■ si ze or depend on it in a very limited way (doubly logarithmic). As a side result, we propose 

■ an improved solution to the weighted level ancestor problem. 

^ ' 1 Introduction 

In this paper we study the following variant of the string matching problem that we 
| call cross-document string matching: given a collection of strings (documents) stored in 

a "database", we want to be able to efficiently search for a pattern in a given document, 
where the pattern itself is a substring of another document. More formally, assuming we 
have a set of documents T%, . . . , T m , we want to answer queries about the occurrences of a 
1 substring Tk[i..j] in a document Tg. 

CN ■ This scenario may occur in various practical situations when we have to search for a 

pattern in a text stored in a database, and the pattern is itself drawn from a string from 
the same database. This is a common situation in bioinformatics, where one may want 
to repeatedly look for genomic elements drawn from a genome within a set of genomic 



sequences involved in the project. In bibliographic search, it is common to look up words 
or citations coming from one document in other documents. Similar scenarios may occur 
in other traditional applications of string matching, such as in the analysis of web server 
logfiles for example. 

In this paper, we study different versions of the cross-document string matching problem. 
First, we distinguish between counting and reporting queries, asking respectively about 
the number of occurrences of Tk[i..j] in Tg or about the occurrences themselves. The two 
query types lead to slightly different solutions. In particular, the counting problem uses the 
weighted level ancestor problem [10, 1] to which we propose a new solution with an improved 
complexity bound. 

We further consider different variants of the two problems. The first one is the dynamic 
version where new documents can be added to the database. In another variant, called doc- 
ument counting and reporting, we only need to respectively count or report the documents 



containing the pattern, rather than counting or reporting pattern occurrences within a given 
document. This version is very close to the document retrieval problem previously studied 
(see [15] and later papers referring to it), with the difference that in our case the pattern is 
itself selected from the documents stored in the database. Finally, we also consider succinct 
data structures for the above problems, where we keep involved index data structure in 
compressed form. 

Let m be the number of stored strings and n the total length of all strings. Our results 
are summarized below: 

(i) for the counting problem, we propose a solution with query time (9(t + loglogm), where 
t = min( ^log occ/ log log occ, log log \P\), P = Tk[i..j] is the searched substring and occ 
is the number of its occurrences in Tf, 

(ii) for the reporting problem, our solution outputs all the occurrences in time 0(log log m + 
occ), 

(hi) in the dynamic case, when new documents can be dynamically added to the database, 
we are able to answer counting queries in time O(logn) and reporting queries in time 
0(logn + occ), whereas the updates take time O(logn) per character, 

(iv) for the document counting and document reporting problems, our algorithms run in time 
O(logn) and 0(t + ndocs) respectively, where t is as above and ndocs is the number of 
reported documents, 

(v) finally, we also present succinct data structures that support counting, reporting, and 
document reporting queries in cross-document scenario (see Theorems 6 and 7 in Sec- 
tion 4.3). 

For problems (i)-(iv), the involved data structures occupy 0(n) space. Interestingly, in the 
cross-document scenario, the query times either do not depend at all on the pattern length 
or depend on it in a very limited (doubly logarithmic) way. 

Throughout the paper positions in strings are numbered from 1. Notation T[i..j] stands 
for the subword T[i]T[i + 1] . . . T[j] of T, and T[i..] denotes the suffix of T starting at 
position i. 

2 Preliminaries 

2.1 Basic Data Stuctures 

We assume a basic knowledge of suffix trees and suffix arrays. 

Besides using suffix trees for individual strings Tj, we will also be using the generalized 
suffix tree for a set of strings T\ , Ti , . . . , T m that can be viewed as the suffix tree for the 
string Xi$iT2$2 • • ■ T m $ m . A leaf in a suffix tree for Tj is associated with a distinct suffix of 
Tj, and a leaf in the generalized suffix tree is associated with a suffix of some document Tj 
together with the index i of this document. We assume that for each node v of a suffix tree, 
the number n v of leaves in the subtree rooted at v, as well as its string depth d(v) can be 
recovered in constant time. Recall that the string depth d(v) is the total length of strings 
labelling the edges along the path from the root to v. 



We will also use the suffix arrays for individual documents as well as the generalized 
suffix array for strings T\,Ti, ■ ■ ■ ,T m . Each entry of the suffix array for Tj is associated 
with a distinct suffix of Tj and each entry of the generalized suffix array for Ti , . . . , T m is 
associated with a suffix of some document Tj and the index i of the document the suffix 
comes from. We store these document indices in a separate array D, called document array, 
such that D [i] = k if the i-th entry of the generalized suffix array for T± , . . . , T m corresponds 
to a suffix coming from 7\. 

For each considered suffix array, we assume available, when needed, two auxiliary arrays: 
an inverted suffix array and another array, called the LCP-array, of longest common prefixes 
between each suffix and the preceding one in the lexicographic order. 

Suffix trees and suffix arrays are naturally related: if the children of any internal node of a 
suffix tree are ordered in the lexicographic order of the labels (actually, of the first symbols 
of the labels, as they are all distinct), then the leaves ordered "left-to-right" correspond 
exactly to the suffix array with respect to the referred suffixes. 

2.2 Weighted Level Ancestor Problem 

The weighted level ancestor problem, defined in [10], is a generalization of the level ancestor 
problem [6, 5] for the case when tree edges are assigned positive weights. 

Consider a rooted tree T whose edges are assigned positive integer weights. For a node 
w, let weight{w) denote the total weight of the edges on the path from the root to w. 
depth{w) denotes the usual tree depth of w. A weighted level ancestor query wla(i>, q) asks, 
given a node v and a positive integer q, for the ancestor w of v of minimal depth such that 
weight(w) > q (wla(v, q) is undefined if there is no such node w). 

Two previously known solutions [10, 1] for weighted level ancestors problem achieve 
O(loglogVF) query time using linear space, where W is the total weight of all tree edges. 
Our data structure also uses 0(n) space, but achieves a faster query time in many special 
cases. We prove the following result. 

Theorem 1. There exists an 0(n) space data structure that answers weighted 
ancestor query wl&(v,q) in 0(min( -y/log gj log log g, log log q)) time, where g = 
min(depth(wla(v , q)), depth(v) — depth(wla(v , q))). 

If every internal node is a branching node, we obtain the following corollary. 

Corollary 1. Suppose that every internal node in T has at least two children. There exists 
an 0(n) space data structure that finds w = wl&(v,q) in O ( y/\og n w / log log n w ) time, where 
the number of leaves in the subtree of w. 

Our approach combines the heavy path decomposition technique of [1] with efficient 
data structures for finger searching in a set of integers. Due to space limitations, the proof 
is given in the Appendix. 



3 Cross-document Pattern Counting and Reporting 



3.1 Counting 

In this section we consider the problem of counting occurrences of a pattern T[i..j] in a 
document Tg. 

Our data structure consists of the generalized suffix array GS 'A for documents T\ , . . . , T m 
and individual suffix trees % for every document Tj. We assume that entries of GSA and 
leaves of suffix trees % are linked by pointers so that given the location of some suffix Tk[i..] 
in GSA, we can retrieve its position in Tk- 

For every suffix tree Tg we store a data structure of Theorem 1 supporting weighted 
level ancestor queries on Tg. We also augment the document array D with a 0(n)-space 
data structure that answers queries rank(k,i) (number of entries storing k before position 
i in D) and select(k,i) (i-th entry from the left storing k). Using the result of [13], we can 
support such rank and select queries in O(loglogm) and 0(1) time respectively. Moreover, 
we construct a data structure that answers range minima queries (RMQ) on the LCP array: 
for any 1 < n < r 2 < n, find the minimum among LCP[r\], . . . LCP[r 2 \. There exists a 
linear space RMQ data structure that supports queries in constant time, see e.g., [4]. An 
RMQ query on the LCP array computes the length of the longest common prefix of two 
suffixes GSA[r{\ and GSA[r 2 ], denoted LCP{n,r 2 ). 

Our counting algorithm consists of two stages. First, using GSA, we identify a position 
p of Tg at which the query pattern T^i-.j] occurs, or determine that no such p exists. Then 
we find the locus of T[i..j] in the suffix tree Tg using a weighted ancestor query. 

Let r be the position of T^[z..] in the GSA. We find indexes r\ = select(£, rank(r, £)) and 
r 2 = select(£,rank(r,£) + 1) in O(loglogm) time. GSA[ri] (resp. GSA[r 2 }) is the closest 
suffix from document Tg that precedes (resp. follows) I\[i..] in the lexicographic order of 
suffixes. Observe now that T^\j,..j\ occurs in Tg if and only if either LCP{r\, r) or LCP(r, r 2 ) 
(or both) is no less than j — i + 1. If this holds, then the starting position p of GSA[r\] 
(respectively, starting position of GSAfo]) is the position of T^[i..j] in Tg. Once such a 
position p is found, we jump to the corresponding leaf T^[p..] in the suffix tree of Tg. 

Let v be the leaf of Tg that contains the suffix Tg[p..\. Then the weighted level ancestor 
u = wla(v , (j — i + 1)) is the locus of T[i..j] in Tg. This is because Tg\p..p + j — i]= T[i..j]. 
By Corollary 1, we can find node u in 0( \/log n u j log log n u ) time, where n u is the number 
of leaf descendants of u. Since u is the locus node of T[i..j], n u is the number of occurrences 
of T[i..j] in Tg. By Theorem 1, we can find u in 0(loglog(j — i + 1)) time. 

Summing up, we obtain the following Theorem. 

Theorem 2. For any 1 < k,£ < m and 1 < i < j < \T\, we can count 
the number of occurrences of T[i..j] in Tg in 0(t + log log m) time, where t = 
min(y / log occ/ log log occ, log log(j — i + 1)) and occ is the number of occurrences. The 
underlying indexing structure takes 0(n) space and can be constructed in 0{n) time. 

Observe that our data structure always answers range counting queries in 0(log log n) time. 
If m and either the pattern length (J — i + 1) or the number of occurrences are sufficiently 



small, the query time is even better. For instance if m = 0(1) and occ = O(l), a query is 
answered in constant time. 

3.2 Reporting 

A reporting query asks for all occurrences of a substring Tk[i..j] in Tg. 

Compared to counting queries, we make a slight change in the data structures: instead of 
using suffix trees for individual documents Tj, we use suffix arrays. Similarly to the previous 
section, we link each entry of GSA to a corresponding entry in the corresponding individual 
suffix array. The rest of the data structures is unchanged. 

We first find an occurrence of Tk[i..j\ in Tg (if there is one) with the method described in 
Section 3.1. Let p be the position of this occurrence in Tg . We then jump to the corresponding 
entry r of the suffix array SAg for the document Tg. Let LCPg be the LCP-array of SAg. 
Starting with entry r, we visit adjacent entries t of SAg moving both to the left and to the 
right as long as LCPg[t] > j — i + 1. While this holds, we report S-A^t] as an occurrence of 
Tk[i..j]. It is easy to observe that the procedure is correct and that no occurrence is missing. 
As a result, we obtain the following theorem. 

Theorem 3. All the occurrences ofTk[i..j] in Tg can be reported in O (log log m + occ) time, 
where occ is the number of occurrences. The underlying indexing structure takes 0(n) space 
and can be constructed in 0(n) time. 

Observe that the algorithm has no dependence whatsoever on the pattern length, and 
that the query time does not depend on the length of documents but only on their number. 

4 Variants of the Problem 

4.1 Dynamic Counting and Reporting 

In this section we focus on a dynamic version of counting and reporting problems, where 
the only dynamic operation consists in adding a document to the database^. 

Recall that in the static case, counting occurrences of T^\i..j\ in Tg is done through the 
following two steps (Section 3.1): 

1. compute position p of some occurrence of Tk[i..j] in Tg, 

2. in the suffix tree of Tg, find the locus of string Tg[p..p + j — i], and retrieve the number 
of leaves in the subtree rooted at u. 

For reporting queries (Section 3.2), Step 1 is basically the same, while Step 2 is different 
and uses an individual suffix array for Tg. 

In the dynamic framework, we follow the same general two-step scenario. Note first 
that since Step 2, for both counting and reporting, uses data structures for individual 

4 document deletions are also possible to support but require some additional constructions that are left to 
the extended version of this paper 



documents only, it trivially applies to the dynamic case without changes. However, Step 1 
requires serious modifications that we describe below. 

Since the suffix array is not well-suited for dynamic updates, at Step 1 we will use 
the generalized suffix tree for Ti, T2, . . . , T m hereafter denoted GST. For each suffix of 
Ti, T2, . . . , T m we store a pointer to the leaf of GST corresponding to this suffix. Unfortu- 
nately, it is not easy to maintain the lexicographically ordered list of suffixes when GST is 
dynamically updated, as it is not easy to quickly determine the location of a newly created 
leaf in the list. Another task to be solved is to support updates of LCP-values and range 
minima queries on these values 5 . 

To this end, we introduce the following additional data structure. We maintain a dy- 
namic doubly-linked list corresponding to the Euler tour of the current GST. This list is 
denoted by EL. Each internal node of GST is stored in two copies in EL, corresponding 
respectively to the first and last visits of the node during the Euler tour. Leaves of GST are 
kept in one copy. Observe that the leaves of GST appear in EL in the same "left-to-right" 
order, although not consecutively. 

On EL, we maintain the data structure of [3] which allows, given two list elements, to 
determine their order in the list in O(l) time (see also [9]). Insertions of elements in the list 
are supported in 0(1) time too. 

Furthermore, we maintain a balanced tree, denoted BT, whose leaves are elements of 
EL. Note that the size of EL is bounded by 2n (n is the size of GST) and the height of BT 
is O(logn). Since the leaves of GST are a subset of the leaves of BT, we call them suffix 
leaves to avoid the ambiguity. 

Each internal node u of BT stores two kinds of information: (i) the rightmost and 
leftmost suffix leaves in the subtree of BT rooted at u, (ii) minimal LCP value among all 
suffix leaves in the subtree of BT rooted at u. 

Finally, we will also need an individual suffix array for each inserted document T,. 

We are now in position to describe the algorithm of Step 1. Like in the static case, we 
first retrieve the leaf of GST corresponding to suffix Tk[i..]. To identify a position of an 
occurrence of Tk[i..j] in Ti, we have to examine the two closest elements in the list of leaves 
of GST, one from right and from left, corresponding to suffixes of Tg. To find these two 
suffixes, we perform a binary search on the suffix array for Tg using order queries of [3] on 
EL. This step takes 0(log \Ti\) time. 

We then check if at least one of these two suffixes corresponds to an occurrence of Tk[i..j] 
in Tg. In a similar way to Section 3, we have to compute the longest common prefix between 
each of these two suffixes and Tfc [«..], and compare this value with (j — % + 1). This amounts 
to computing the minimal LCP value among all the suffixes of the corresponding range, 
i.e. to answering a range-minima query. To do this, we resort to the list EL and the tree 
BT and use the standard technique used for answering range queries: for any sublist L' 
of EL we can identify O(logn) nodes Vi of BT, so that an element e belongs to L' if and 
only if it is a leaf descendant of some node V{. We retrieve O(logn) nodes V{ that cover the 

5 supporting dynamic RMQ could be done with the general method of [8], however we will give here a 
simpler ad hoc algorithm with the same time complexity, which is sufficient for our purposes 



relevant sublist of EL. The least among all minimal LCP values stored in nodes v\ is the 
minimal LCP value for the specified range of suffixes. Thus, computing the length of the 
longest common prefix of two suffixes takes O(logn) time. Once a witness occurrence of 
Tk[i-.j] in Tn is found, Step 2 is done as explained in Sections 3.1,3.2. 
The query time bounds are summarized in the following lemma. 

Lemma 1. Using the above data structures, counting and reporting all occurrences of 
T^[i..j] in Ti can be done respectively in time O(logn) and time 0(logn + occ) ; where 
occ is the number of reported occurrences. 

We now explain how the involved data structures are updated. Suppose that we add 
a new document T m+ i. Extending the generalized suffix tree by T m+ i is done in time 
0{\T m+ i\) by McCreight's or Ukkonen's algorithm, i.e. in O(l) amortized time per symbol. 

When a new node v is added to a suffix tree, the following updates should be done (in 
order): 

(i) insert v at the right place of the list EL (in two copies if v is an internal node), 

(ii) rebalance the tree BT if needed, 

(iii) if v is a leaf of GST (i.e. a suffix leaf of BT), update LCP values and rightmost /leftmost 
suffix leaf information in BT, 

To see how update (i) works, we have to recall how suffix tree is updated when a new 
document is inserted. Two possible updates are creation of a new internal node v by splitting 
an edge into two (edge subdivision) and creating a new leaf u as a child of an existing node. 
In the first case, we insert the first copy of v right after the first copy of its parent, and the 
second copy right before the second copy of its parent. In the second case, the parent of u 
has already at least one child, and we insert u either right after the second (or the only) 
copy of its left sibling, or right before the first (or the only) copy of its right sibling. 

Rebalancing the tree BT (update (ii)) is done using standard methods. Observe that 
during the rebalancing we may have to adjust the LCP and rightmost /leftmost suffix leaf 
information for internal nodes, but this is easy to do as only a finite number of local 
modifications is done at each level. 

Update (iii) is triggered when a new leaf u is created in GST and added to EL. First 
of all, we have to compute the LCP value for u and possibly to update the LCP value of 
the next suffix leaf u' to the right of u in EL. This is done in 0(1) time as follows. At the 
moment when u is created, we memorize the string depth of its parent D = d(parent(u)). 
Recall that the parent of u already has at least one child before u is created. If u is neither 
the leftmost nor the rightmost child of its parent, then we set LCP(u) = D and LCP(u') 
remains unchanged (actually it also equals D). If u is the leftmost child of its parent, then 
we set LCP(u) = LCP(u') and then LCP(u') = D. Finally, if u is the rightmost child, 
then LCP{u) = D and LCP(u') remains unchanged. 

We then have to follow the path in BT from the new leaf u to the root and possibly 
update the LCP and rightmost/leftmost suffix leaf information for all nodes on this path. 
These updates are straightforward. Furthermore, during this traversal we also identify suffix 



leaf v! (as the leftmost child of the first right sibling encountered during the traversal), 
update its LCP value and, if necessary, the LCP values on the path from v! to the root of 
BT. All these steps take time O(logn). 

Thus, updates of all involved data structures take O(logn) time per symbol. The fol- 
lowing theorem summarizes the results of this section. 

Theorem 4. In the case when documents can be added dynamically, the number of occur- 
rences ofTk[i..j] in Ti can be computed in time O(logn) and reporting these occurrences can 
be done in time 0(logn + occ), where occ is their number. The underlying data structure 
occupies 0(n) space and an update takes O(logro) time per character. 

4.2 Document Counting and Reporting 

Consider a static collection of documents T\ , . . . , T m . In this section we focus on document 
reporting and counting queries: report or count the documents which contain at least one 
occurrence of Tjji.j], for some 1 < k < m and i < j. 

For both counting and reporting, we use the generalized suffix tree, generalized suffix 
array and the document array D for T±,T2, ■ ■ ■ ,T m . We first retrieve the leaf of the gen- 
eralized suffix tree labelled by Tfc[i.] and compute its highest ancestor u of string depth 
at least j — i + 1, using the weighted level ancestor technique of Section 2.2. The suffixes 
of Ti,T2, . . . ,T m starting with Tk[i..j] (i.e. occurrences of Tk[i..j}) correspond then to the 
leaves of the subtree rooted at u, and vice versa. As shown in section 3.1, this step takes 
0(t) time, where t = min( y^og occ/ log log occ, log log(j — i + 1)) and occ is the number of 
occurrences of Tjji.j] (this time in all documents). 

Once u has been computed, we retrieve the interval [left(u)..right(u)] of ranks of all 
the leaves under interest. We are then left with the problem of counting/reporting distinct 
values in D[left(u)..right(u)]. This problem is exactly the same as the color counting/ color 
reporting problem that has been studied extensively (see e.g., [12] and references therein). 

For color reporting queries, we can use the solution of [15] based on a 0(n)-space data 
structure for RMQ, applied to (a transform of) the document array D. The pre-processing 
time is 0(n). Each document is then reported in O(l) time, i.e. all relevant documents are 
reported in O (ndocs) time, where ndocs is their number. The whole reporting query then 
takes time 0(t + ndocs) for t defined above. 

For counting, we use the solution described in [7]. The data structure requires 0(n) 
space and a color counting query takes O(logn) time. The following theorem presents a 
summary. 

Theorem 5. We can store a collection of documents Ti,...,T m in a linear space data 
structure, so that for any pattern P = Tk[i..j] all documents that contain P can 
be reported and counted in 0(t + ndocs) and O(logn) time respectively. Here t = 
min( yTog occ/ log log occ, log log \ P\), ndocs is the number of documents that contain P 
and occ is the number of occurrences of P in all documents. 

Again, our query time does not depend on the pattern length, or this dependency is reduced. 



4.3 Compact Counting, Reporting and Document Reporting 

In this section, we show how our reporting and counting probems can be solved on succinct 
data structures [16]. 

Reporting and Counting. Our compact solution is based on compressed suffix ar- 
rays [14]. A compressed suffix array for a text T uses |C5t4| bits of space and enables 
us to retrieve the position of the suffix of rank r, the rank of a suffix T[i..], and the char- 
acter T[i] in time Lookup(n). Different trade-offs between space usage and query time can 
be achieved (see [16] for a survey). 

Our data structure consists of a compressed generalized suffix array CSA for Ti, . . . , T m 
and compressed suffix arrays CSAi for each document Tj. In [17] it was shown that using 
0{n) extra bits, the length of the longest common prefix of any two suffixes can be computed 
in 0(Lookup(n)) time. Besides, the ranks of any two suffixes T^[s..] and T^[p..] can be 
compared in 0(Lookup(n)) time: it suffices to compare T?\p + /] with Ty\s + /] for / = 
LCP(T k [s..],T e [p..]). 

Note that ranks of the suffixes of T(_ starting with Tk[i..j] form an interval [ri,T2]. We 
use a binary search on the compressed suffix array of Tg to find r\ and T2 ■ At each step of 
the binary search we compare a suffix of Ti with Tfc[i..]. Therefore [n,r2] can be found in 
0(Lookup(n) ■ logn) time. Obviously, the number of occurrences of Tj~[i..j] in Tj? is r2 — r\. 
To report the occurrences, we compute the suffixes of Tg with ranks in interval [ri, t2\. 

Theorem 6. All occurrences ofTk[i..j] in Ti can be counted in 0{Lookup(n) ■ logn) time 
and reported in 0((logn + occ) Lookup(n)) time, where occ is the number of those. The 
underlying indexing structure takes 2\CSA\ + 0(n + mlog ^) bits of memory. 

Document Reporting Again, we use a binary search on the generalized suffix array 
to find the rank interval [n,r2] of suffixes that start with Tk[i..j]. This can be done in 
0(Lookup(n) ■ logn) time. 

In [18], it was shown how to report for any 1 < r\ < r<i < n all distinct documents Tf 
such that at least one suffix of Tf occurs at position r, r\ < r < r^, of the generalized suffix 
array. His construction uses 0(n + mlog^) additional bits, and all relevant documents 
are reported in 0(Lookup(n) • ndocs) time, where ndocs is the number of documents that 
contain Tk[i..j]. Summing up, we obtain the following result. 

Theorem 7. All documents containing T^[i..j] can be reported in 0((logn + 
ndocs) Lookup(n)) time, where ndocs is the number of those. The underlying indexing struc- 
ture takes 2|CS'^4| + 0(n + mlog — ) bits of space. 
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Appendix 



Here we prove Theorem 1. We use the heavy path decomposition technique of [1]. 

Heavy Path Decomposition. A path ir in T is heavy if every node u on it has at most twice 
as many nodes in its subtree as its child v on ir. A tree T can be decomposed into paths 
using the following procedure: we find the longest heavy path 7iy that starts at the root of 
T and remove all edges of 7r r from T . All remaining vertices of T belong to a forest; we 
recursively repeat the same procedure in every tree of that forest. 

We can represent the decomposition into heavy paths using a tree T. Each node in 
T corresponds to a heavy path ttj in T. A node Wj is a child of a node in T if the head 
of ttj (i.e., the highest node in ivj) is a child of some node u G tt{. Some node in 7Tj has at 
least twice as many descendants as each node in ttj; hence, T has height O(logn). 

An 0{n log n) Space Solution. Let pj denote a root-to-leaf path in T. For a node w in T 
let weight^) denote the weight of the head of ir, where ir is the heavy path represented by 
w in T. We store a data structure D(pj) that contains the values of weight^) for all nodes 
v G pj. D(jpj) contains O(logn) elements; hence, we can find the highest node v G pj such 
that weight^) > q in O(l) time. This can be achieved by storing the weights of all nodes 
from pj in the q- heap [11]. 

For every heavy path irj we store the weights of all nodes u G Ttj in the data struc- 
ture E{iij ) ; using E(iTj), we can find for any integer q the lightest node u G ttj such that 
weight(u) > q. Using Theorem 1.5 in [2], we can find the above defined node u G iTj 
in 0( ydog n' I log log n') time where n' = min(n^,n/), = \{v G pj\weight(v) > 
weight(u)}\, and n\ = \{v G pj\weight(v) < weight(u)}\. Moreover, we can also find 
the node u in O (log log q) time; we will describe the data structure in the full version of 
this paper. Thus E(iTj) supports queries in 0(min( -y/log n'/ log log n', log log q)) time. 

For each node u G T we store a pointer to the heavy path ir that contains u and to the 
corresponding node w G T. 

A query wla(u, q) can be answered as follows. Let v denote the node in T that cor- 
responds to the heavy path containing v. Let pj be an arbitrary root-to- leaf path in 
T that also contains v. Using D(pj) we can find the highest node u G pj, such that 
weight(u) > q in 0(1) time. Let 7Tj denote the heavy path in T that corresponds to 
the parent of m, and 7r s denote the path that corresponds to u. If the weighted ancestor 
wl&(v,q) is not the head of tt s , then wla(v,g) belongs to the path ir t . Using E(n t ), we 
can find u = wl&(v,q) in 0(min( ^log n' / log log n', log log q)) time where n' = min(n/ l ,n;), 
^ft = |{ ^ G | weight{v) > weight(u) }|, and = |{ v G Pi | weight(v) < weight{u) }|. 

All data structures Efa) use linear space. Since there are 0(n) leaves in T and each 
path pi contains O(logn) nodes, all D(pi) use 0(n log n) space. 

Lemma 2. There exists a 0{n log n) space data structure that finds the weighted level an- 
cestor u in 0(min( -y/log n' / log log n', log log q)) time. 



An 0{n) Space Solution. We can reduce the space from 0{n log n) to 0(n) using a micro- 
macro tree decomposition. Let 7o be a tree induced by the nodes of T that have at least 
logn/8 descendants. The tree % has at most Oinj log n) leaves. We construct the data 
structure described above for 7o; since 7o has 0(nj log n) leaves, To also has 0(n/ log n) 
leaves. Therefore all structures D(pj) use 0(n) words of space. All E(-Ki) also use 0(n) 
words of space. If we remove all nodes of 7o from T, the remaining forest T consists of 
0(n) nodes. Every tree T, i > 1, in T consists of O(logn) nodes. Nodes of % are stored in 
a data structure that uses linear space and answers weighted ancestor queries in O(l) time. 
This data structure will be described later in this section. 

Suppose that a weighted ancestor wla(i>, q) should be found. If v G 7o, we answer the 
query using the data structure for %■ If v belongs to some % for i > 1, we check the weight 
w r of root(7i). If w r < q, we search for wla(t> , q) in %. Otherwise we identify the parent v\ of 
root(Ti) and find wla(i>i,g) in To- If wla(vi,q) in To is undefined, then wla(v,q) = root(Ti). 

A Data Structure for a Small Tree. It remains to describe the data structure for a tree T, 
i > 1. Since % contains a small number of nodes, we can answer weighted level ancestor 
queries on % using a look-up table V. V contains information about any tree with up 
to logn/8 nodes, such that node weights are positive integers bounded by logn/8. For 
any such tree T, for any node v of T, and for any integer q € [1, logn/8], we store the 
pointer to wla(t> , q) in T ■ There are 0(2 logn / 4 ) different trees T (see e.g., [5] for a simple 
proof); for any T, we can assign weights to nodes in less than (logn/8)! ways. For any 
weighted tree T there are at most (logn) 2 /64 different pairs v, q. Hence, the table V 
contains 0(2 logn / 4 (logn) 2 (logn/8)!) = o(n) entries. We need only one look-up table V for 
all mini-trees %■ 

We can now answer a weighted level ancestor query on T% using reduction to rank 
space. The rank of a node u in a tree T is defined as rank(n, T) = \{v £ T| weight(v) < 
weight(u) }|. The successor of an integer q in a tree T is the lightest node u <G T such that 
weight(u) > q. The rank rank(g, T) of an integer q is defined as the rank of its successor. 
Let rank(T) denote the tree T in which the weight of every node is replaced with its rank. 
The weight of a node u € T is not smaller than q if an only if rank(n, T) > rank(g, T). 
Therefore we can find wla(u,g) in some % as follows. For every % we store a pointer to 
% = rank(7i). Given a query wla(t> , q), we find rank(g, %) in O(l) time using a q-heap [11]. 
Let v 1 be the node in Ti that corresponds to the node v. We find u' = wla(t/, rank(g, %)) 
in Ti using the table V. Then the node u in T that corresponds to v! is the weighted level 
ancestor of v. 



